Predicting PM2.5 Concentration in NYC from Measured Components

Background

PM2.5 refers to airborne particles with diameters less than 2.5 micrometers. These particles are small enough to be able to penetrate into human circulation systems and therefore could result in even more serious impacts to the human health compared to larger particles such as PM10. Many epidemiological studies have revealed that elevated PM2.5 concentrations were associated to a number of adverse health effects such as cardiovascular and respiratory diseases as well as lung cancer. Currently, PM2.5 is one of the major air pollutants in many countries/regions. It has been a particularly serious problem in urban areas. In addition, PM2.5 also lead to a series of environmental effects including visibility impairment, environmental damage and material damage.

PM2.5 is in fact a very complex mixture of a number of chemical species, water droplets and even microbes attached to them. The compositions (i.e. the proportion of these components) of PM2.5 is closely related to its physical-chemical properties and toxicity. Therefore, it has been one of the major focuses of many researchers and scientists for better understanding their effects on human health. In the meantime, the composition is also the fingerprint of pollution sources and scientists have been using them to trace and quantify these major emission sources.

The United States Environmental Protection Agency (EPA) has established multiple nationwide monitoring networks[1] to measure PM2.5 concentration as well as its components to support air quality management, air quality research and studies on public health in the United States. However, due to the consideration of cost and limitation of analytical techniques, EPA only routinely measures a subset of the major components which are organic carbon (OC), elemental carbon (EC), inorganic ions (sulfate, nitrate, and ammonium), and major elements (mineral elements, salt, and others). Since oxygen (O) and hydrogen (H) are not directly measured in these networks, researchers have been exploring various equations to account for their presence, thereby approximating gravimetric mass of PM2.5. These equations predominantly took the form of linear equations with the components as independent variables and PM2.5 concentration as dependent variables[2]. In air quality research, researchers usually adopt an approach that infer linear coefficients based on both the possible chemical forms of these components (e.g. iron could be present in the nature as Fe3O4) and the measured data (e.g. regression).

The objective of this project is to utilize the measurements (both PM2.5 and its components) collected from the US nationwide Chemical Speciation Network (CSN) and explore potential regression models to predict PM2.5 concentration based on the composition of PM2.5. This project also aims to evaluate the possibility of utilizing purely mathematical approaches to reconstruct/predict PM2.5 concentrations using all or a subset of the component concentration.

Data collection and cleaning

The raw data files were retrieved from US EPA air quality monitoring data repository[3]. The 24-hour aggregated PM2.5 and component concentration were collected as part of the US Chemical Speciation Network (CSN). The CSN network is comprised of over 100 monitoring stations across the US. Each air monitoring station measures water-soluble ions, nitrate, sulfate, organic carbon (OC), elemental carbon (EC), elements in the PM2.5 and meteorological conditions (temperature, wind direction, wind speed, relative humidity, etc.). Sampling occurs every 1 in 3 or every 1 in 6 days. For this project, we selected data collected from a monitoring station located in the New York City during 2011 and 2014.

Data collection and cleaning procedure includes:
. retrieve the raw data files for PM2.5 and its components for year 2011 to 2014
. identify the station and sampler codes to extract the data for New York City from these raw data files
. keep only a subset of the variables that are relevant to this analysis
. examine the missing values and keep only days that have data for all the variables
. export the formatted and cleaned dataset to a new set of csv files. Data for 2011 and 2012 (trainingset.csv) were used to train the model and data for 2013 and 2014 (testingset.csv) were used to test the regression models.

The cleaning and formatting process dramatically reduced the size of the testing and training datasets. For example, the original raw datasets are as large as over 700MB and contain as many as over 2 million rows. The compiled and cleaned datasets for analysis are typically around 30 KB and contain only about 160 rows.

Analysis

Exploratory data analysis

Before we evaluate the multiple regression models, the exploratory analysis was performed for the training dataset to understand the characteristics of these variables.

## 'data.frame':    166 obs. of  38 variables:
##  $ Antimony.  : num  0 0 0.011 0.033 0.009 0.036 0 0.005 0.016 0.021 ...
##  $ Arsenic.   : num  0.002 0 0 0 0.0026 0 0.001 0 0.002 0 ...
##  $ Aluminum.  : num  0.031 0.028 0.001 0.026 0.032 0.031 0.013 0.026 0.007 0.043 ...
##  $ Barium.    : num  0 0 0.009 0 0 0 0 0.009 0 0 ...
##  $ Bromine.   : num  0.0025 0.0032 0.0041 0.0029 0.0035 0.0036 0.0043 0.0075 0.0026 0.0061 ...
##  $ Cadmium.   : num  0 0 0.011 0 0 0 0 0.019 0.012 0.002 ...
##  $ Calcium.   : num  0.0587 0.0426 0.0437 0.0621 0.123 0.0424 0.0631 0.111 0.0571 0.135 ...
##  $ Chromium.  : num  0.001 0.001 0.002 0.002 0.002 0 0.001 0 0 0 ...
##  $ Cobalt.    : num  0.001 0.0023 0.001 0.0016 0.001 0 0.001 0 0.001 0.001 ...
##  $ Copper.    : num  0.002 0.003 0.007 0.0032 0.0395 0.001 0.008 0.009 0 0.0071 ...
##  $ Chlorine.  : num  0.0497 0.03 0.006 0.0301 0.109 0.008 0.063 0.044 0.106 0.16 ...
##  $ Cerium.    : num  0 0 0 0 0 0 0 0 0.002 0 ...
##  $ Cesium.    : num  0 0 0.002 0.004 0 0 0 0.001 0 0 ...
##  $ Iron.      : num  0.105 0.0719 0.0713 0.214 0.136 0.0869 0.115 0.282 0.0641 0.2 ...
##  $ Lead.      : num  0.002 0.004 0.005 0.0063 0.0102 0.001 0.0077 0.003 0 0.0028 ...
##  $ Indium.    : num  0.013 0.007 0.012 0.011 0 0 0 0.015 0.004 0 ...
##  $ Manganese. : num  0.001 0.0028 0.0034 0.0027 0.0043 0.001 0.0049 0.0065 0 0.005 ...
##  $ Nickel.    : num  0.0082 0.0045 0.0063 0.008 0.014 0.004 0.0051 0.0062 0.0043 0.0157 ...
##  $ Magnesium. : num  0 0 0.024 0 0 0 0 0 0 0 ...
##  $ Phosphorus.: num  0 0 0.003 0 0.009 0 0 0 0.005 0.016 ...
##  $ Selenium.  : num  0 0 0.001 0 0.002 0 0.001 0.001 0.001 0 ...
##  $ Tin.       : num  0 0.03 0 0 0 0 0 0.011 0.013 0 ...
##  $ Titanium.  : num  0 0.003 0.003 0.002 0 0 0.001 0.0088 0 0.004 ...
##  $ Vanadium.  : num  0.002 0 0.0064 0.001 0.0132 0.001 0.006 0.0116 0.001 0.0061 ...
##  $ Silicon.   : num  0.029 0.02 0.0355 0.032 0.044 0.023 0.086 0.0943 0.0378 0.0632 ...
##  $ Silver.    : num  0 0 0.016 0 0 0 0 0.011 0.011 0.008 ...
##  $ Zinc.      : num  0.033 0.0287 0.031 0.0416 0.0771 0.0175 0.0246 0.032 0.011 0.0661 ...
##  $ Strontium. : num  0.001 0.001 0.001 0 0.001 0 0 0 0.001 0 ...
##  $ Rubidium.  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Zirconium. : num  0 0.0064 0 0 0 0 0 0.004 0.003 0 ...
##  $ Ammonium   : num  2.06 1.32 2.92 1.68 1.74 1.32 3.16 2.88 0.295 1.3 ...
##  $ Sodium     : num  0.05 0.07 0.07 0.09 0.08 0.04 0.07 0.18 0.224 0.196 ...
##  $ Potassium  : num  0.074 0.079 0.083 0.058 0.074 0 0.083 0.134 0.034 0.064 ...
##  $ Nitrate    : num  4.34 2.35 5.24 2.9 2.43 1.38 5.99 4.82 0.554 2.68 ...
##  $ OC         : num  2.94 2.77 3.35 2.5 3.47 2.05 3.48 6.49 2.44 3.47 ...
##  $ EC         : num  1.4 1.16 1.2 1.23 2.08 1.07 1.17 3.34 0.525 1.55 ...
##  $ Sulfate.   : num  2.58 2.33 3.42 2.65 3.35 2.77 2.79 3.01 0.834 1.39 ...
##  $ PM2.5      : num  16.6 12.7 18.9 13.7 16.7 9.7 19.7 27 6.2 13.9 ...
## [1] "Summary Statistics of the training dataset"
##    Antimony.           Arsenic.           Aluminum.      
##  Min.   :0.000000   Min.   :0.0000000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.0000000   1st Qu.:0.00300  
##  Median :0.000000   Median :0.0000000   Median :0.01500  
##  Mean   :0.007765   Mean   :0.0004341   Mean   :0.02096  
##  3rd Qu.:0.011750   3rd Qu.:0.0010000   3rd Qu.:0.03075  
##  Max.   :0.072000   Max.   :0.0040000   Max.   :0.12500  
##     Barium.            Bromine.           Cadmium.          Calcium.      
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.002000   1st Qu.:0.00000   1st Qu.:0.03335  
##  Median :0.000000   Median :0.002800   Median :0.00000   Median :0.04780  
##  Mean   :0.001121   Mean   :0.003106   Mean   :0.00194   Mean   :0.06319  
##  3rd Qu.:0.000000   3rd Qu.:0.003975   3rd Qu.:0.00075   3rd Qu.:0.06298  
##  Max.   :0.018000   Max.   :0.018300   Max.   :0.02200   Max.   :1.50000  
##    Chromium.           Cobalt.             Copper.        
##  Min.   :0.000000   Min.   :0.0000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.0000000   1st Qu.:0.002600  
##  Median :0.001000   Median :0.0010000   Median :0.004200  
##  Mean   :0.002037   Mean   :0.0007235   Mean   :0.004973  
##  3rd Qu.:0.002000   3rd Qu.:0.0010000   3rd Qu.:0.006675  
##  Max.   :0.042700   Max.   :0.0029000   Max.   :0.039500  
##    Chlorine.          Cerium.             Cesium.            Iron.        
##  Min.   :0.00000   Min.   :0.0000000   Min.   :0.00000   Min.   :0.00210  
##  1st Qu.:0.00400   1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.08712  
##  Median :0.00985   Median :0.0000000   Median :0.00000   Median :0.11600  
##  Mean   :0.03372   Mean   :0.0001084   Mean   :0.00112   Mean   :0.13101  
##  3rd Qu.:0.02210   3rd Qu.:0.0000000   3rd Qu.:0.00100   3rd Qu.:0.16475  
##  Max.   :1.56000   Max.   :0.0030000   Max.   :0.01000   Max.   :0.32100  
##      Lead.             Indium.           Manganese.      
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.001000  
##  Median :0.002000   Median :0.000000   Median :0.002100  
##  Mean   :0.002072   Mean   :0.003187   Mean   :0.002319  
##  3rd Qu.:0.003000   3rd Qu.:0.001750   3rd Qu.:0.003300  
##  Max.   :0.012400   Max.   :0.041000   Max.   :0.008000  
##     Nickel.           Magnesium.        Phosphorus.      
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.002225   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.003250   Median :0.000000   Median :0.000000  
##  Mean   :0.004195   Mean   :0.006387   Mean   :0.001157  
##  3rd Qu.:0.005000   3rd Qu.:0.008000   3rd Qu.:0.000000  
##  Max.   :0.015700   Max.   :0.122000   Max.   :0.072000  
##    Selenium.              Tin.            Titanium.       
##  Min.   :0.0000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.0000000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.0000000   Median :0.000000   Median :0.002000  
##  Mean   :0.0004181   Mean   :0.002904   Mean   :0.002229  
##  3rd Qu.:0.0010000   3rd Qu.:0.000000   3rd Qu.:0.003000  
##  Max.   :0.0032000   Max.   :0.046000   Max.   :0.009700  
##    Vanadium.           Silicon.          Silver.             Zinc.        
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.001000   1st Qu.:0.03000   1st Qu.:0.000000   1st Qu.:0.01100  
##  Median :0.003000   Median :0.04610   Median :0.000000   Median :0.01760  
##  Mean   :0.004282   Mean   :0.06207   Mean   :0.001619   Mean   :0.02242  
##  3rd Qu.:0.006100   3rd Qu.:0.06928   3rd Qu.:0.000000   3rd Qu.:0.02675  
##  Max.   :0.025300   Max.   :1.36000   Max.   :0.018000   Max.   :0.13000  
##    Strontium.          Rubidium.           Zirconium.       
##  Min.   :0.0000000   Min.   :0.0000000   Min.   :0.0000000  
##  1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.0000000  
##  Median :0.0000000   Median :0.0000000   Median :0.0000000  
##  Mean   :0.0006488   Mean   :0.0001723   Mean   :0.0007681  
##  3rd Qu.:0.0010000   3rd Qu.:0.0000000   3rd Qu.:0.0000000  
##  Max.   :0.0110000   Max.   :0.0020000   Max.   :0.0093000  
##     Ammonium          Sodium          Potassium          Nitrate      
##  Min.   :0.0000   Min.   :0.01630   Min.   :0.00000   Min.   :0.0952  
##  1st Qu.:0.4482   1st Qu.:0.04000   1st Qu.:0.00915   1st Qu.:0.6110  
##  Median :0.7895   Median :0.07000   Median :0.02700   Median :1.1800  
##  Mean   :1.0200   Mean   :0.09725   Mean   :0.03744   Mean   :1.6760  
##  3rd Qu.:1.3200   3rd Qu.:0.11300   3rd Qu.:0.05058   3rd Qu.:2.0075  
##  Max.   :3.9400   Max.   :1.31000   Max.   :0.30400   Max.   :8.1200  
##        OC              EC            Sulfate.         PM2.5       
##  Min.   :0.029   Min.   :0.0000   Min.   :0.058   Min.   : 2.900  
##  1st Qu.:2.042   1st Qu.:0.7615   1st Qu.:1.250   1st Qu.: 7.525  
##  Median :2.535   Median :1.0700   Median :1.820   Median :10.350  
##  Mean   :2.820   Mean   :1.2022   Mean   :2.198   Mean   :11.777  
##  3rd Qu.:3.365   3rd Qu.:1.5100   3rd Qu.:2.725   3rd Qu.:14.575  
##  Max.   :6.490   Max.   :3.6600   Max.   :9.600   Max.   :29.500
## [1] "Summary of standard deviations of each variable"
##    Antimony.     Arsenic.  Aluminum.     Barium.    Bromine.   Cadmium.
## 1 0.01351449 0.0008385659 0.02242708 0.002729698 0.002107611 0.00437167
##    Calcium.   Chromium.      Cobalt.     Copper. Chlorine.      Cerium.
## 1 0.1182131 0.003887062 0.0007453223 0.004054483 0.1329741 0.0004267438
##       Cesium.      Iron.       Lead.     Indium.  Manganese.     Nickel.
## 1 0.002201364 0.06011016 0.002371596 0.006902442 0.001750726 0.003122156
##   Magnesium. Phosphorus.    Selenium.        Tin.   Titanium.   Vanadium.
## 1 0.01368735 0.006294744 0.0006866951 0.007416794 0.002081085 0.004761376
##    Silicon.    Silver.      Zinc.  Strontium.    Rubidium.  Zirconium.
## 1 0.1079582 0.00388983 0.01751811 0.001357482 0.0004165699 0.001863884
##    Ammonium    Sodium  Potassium  Nitrate       OC        EC Sulfate.
## 1 0.8129256 0.1230461 0.04183482 1.531039 1.076872 0.6116432 1.383489
##      PM2.5
## 1 5.736485

We would also like to visualize the relative contribution of each component to the total PM2.5 concentration by the following bar chart. There are 38 components measured and reported by the EPA in this dataset. From the bar chart, we see that organic carbon, sulfate, nitrate, elemental carbon and ammonium account for about 75% of total PM2.5. Other species only account for about one quarter with many elements contributing to negligible amount of PM2.5 mass.

The time series of PM2.5 and its selected key components show seasonal variations. For example, PM2.5 concentration has two peaks in summer and winter months. Sulfate tends to be more abundant in the air in the summer whereas nitrate concentration primarily peaks during cold seasons.

## Using Date as id variables

Further investigation of the distribution of these variables in the training dataset suggest that many elements are present in trace level and frequently reported as zero because the concentrations are below detection limit. The concentration of Other major components as well as PM2.5 appear to be following lognormal-like distribution.

## Using Date as id variables

## Saving 7 x 5 in image

Subsetting

Using subsetting, we would like to find out the best-subset. Forward, backward and hybrid methods were explored in this analysis.

For example, the following graphs visualize the results from the forward subsetting method.

## [1] "Forward subsetting"

Backward and hybrid method generate similar outputs and were therefore not shown in this report. The following table compares the number of variables in the best-fit model by method and criteria metrics. General consistency among the methods were observed and the major disagreement on the number of variables in the best-fit model is on the criteria metrics. For example, using BIC as the criterion, the best-fit mode reduced the total number of predictors to 8 or 9.

## [1] "Comparison of the number of variables in the best-fit model"
method Cp BIC adjr2
forward 16 9 18
backward 15 8 17
hybrid 15 8 17

Due to the complexity of evaluating every possible “best-fit” model suggested by our analysis. We chose the hybrid model with 8 variables as an example and further evaluated its performance. This model suggests that aluminum, calcium, vanadium, zirconium, nitrate, OC, EC and sulfate as the primary predictors[4].

Multivariate linear regression

We first built a simple linear model using the lm function.

## 
## Call:
## lm(formula = PM2.5 ~ ., data = training.clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5467 -0.9236 -0.0770  0.7302  6.9903 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.05432    0.72262  -0.075 0.940201    
## Antimony.      0.10231   13.18986   0.008 0.993823    
## Arsenic.      79.09069  223.09004   0.355 0.723530    
## Aluminum.    -18.70193   10.31103  -1.814 0.072054 .  
## Barium.       52.10580   76.94355   0.677 0.499504    
## Bromine.     -36.94899  107.40413  -0.344 0.731397    
## Cadmium.      21.36963   48.87916   0.437 0.662708    
## Calcium.       6.75570    8.83898   0.764 0.446091    
## Chromium.    -54.22773   55.28679  -0.981 0.328520    
## Cobalt.     -452.94674  258.82612  -1.750 0.082514 .  
## Copper.       24.94005   59.57917   0.419 0.676207    
## Chlorine.      2.88164    2.13684   1.349 0.179863    
## Cerium.     -297.94003  398.75607  -0.747 0.456329    
## Cesium.      109.38041   77.05210   1.420 0.158165    
## Iron.         -6.07475    4.64218  -1.309 0.193016    
## Lead.        -43.39213   85.14568  -0.510 0.611194    
## Indium.      -12.94100   25.28986  -0.512 0.609738    
## Manganese.   266.42961  132.33635   2.013 0.046183 *  
## Nickel.      -55.03376  114.02669  -0.483 0.630176    
## Magnesium.    -9.71286   19.10295  -0.508 0.612014    
## Phosphorus.   -3.06297   45.09807  -0.068 0.945957    
## Selenium.     -0.41866  271.13067  -0.002 0.998770    
## Tin.         -36.47231   22.65694  -1.610 0.109913    
## Titanium.    -80.04393   92.65108  -0.864 0.389243    
## Vanadium.   -117.99091   52.34268  -2.254 0.025883 *  
## Silicon.       1.02088    9.62033   0.106 0.915656    
## Silver.      -52.72089   50.85873  -1.037 0.301870    
## Zinc.         13.72810   24.55020   0.559 0.577013    
## Strontium.    38.36922  126.47089   0.303 0.762090    
## Rubidium.    180.64266  445.45261   0.406 0.685768    
## Zirconium.   474.40053   90.49812   5.242 6.36e-07 ***
## Ammonium       0.39389    1.23842   0.318 0.750961    
## Sodium         0.06818    2.64351   0.026 0.979465    
## Potassium     -0.36520    4.49301  -0.081 0.935345    
## Nitrate        0.80154    0.41880   1.914 0.057864 .  
## OC             1.76803    0.33151   5.333 4.23e-07 ***
## EC             1.82927    0.63687   2.872 0.004772 ** 
## Sulfate.       1.66945    0.45590   3.662 0.000365 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.897 on 128 degrees of freedom
## Multiple R-squared:  0.9152, Adjusted R-squared:  0.8907 
## F-statistic: 37.34 on 37 and 128 DF,  p-value: < 2.2e-16

A first glimpse at the results indicate that the results of the linear regression is far from ideal.
1) All the coefficients have to be positive in order to make sense practically because the independent variables are components of PM2.5 and represent concentrations in the air.
2) The large standard errors associated with the estimates of coefficients are present.
3) A big portion of the p-values are fairly large, suggesting that the corresponding coefficients are NOT statistically distinguishable from 0.

The variance inflation factor (VIF) is equal to 11.792711, which is much greater than 2 (widely accepted level for lm regression), suggesting strong impact of the multicollinearity in this dataset (recall that no multicollinearity is one of the assumptions for the linear model).

It is not surprising that many components are inter-correlated. For example, silicon and calcium has a correlation of 0.9726204. The practical explanation is that both silicon and calcium are “crustal elements”, meaning that they both predominantly come into PM2.5 from soil/dust that is resuspended by natural and human activities. These correlations are very indicative in the environmental science research but could dramatically harm the linear regression model we are investigating here.

Principal Component Regression

With cross-validation, principal component regression analysis identified that 5 variables were able to explain over 80% of the variations of PM2.5. Increasing the number of variables only marginally increase the R2 and reduce the mean squared prediction error (MSEP). Therefore, with the purpose of dimension reduction, regression model with 5 variable was selected for the further evaluation in this project.

##   Antimony.    Arsenic.   Aluminum.     Barium.    Bromine.    Cadmium. 
## -0.02954041  0.15975805  0.50783592  0.08421147  0.52201557 -0.13195551 
##    Calcium.   Chromium.     Cobalt.     Copper.   Chlorine.     Cerium. 
## -0.01166957 -0.05574703 -0.02294926  0.37928667  0.03053707 -0.31483617 
##     Cesium.       Iron.       Lead.     Indium.  Manganese.     Nickel. 
## -0.24889491  0.29354575  0.17326462  0.04186939  0.32290633  0.19987074 
##  Magnesium. Phosphorus.   Selenium.        Tin.   Titanium.   Vanadium. 
## -0.05167313 -0.32315997  0.20791993  0.01516300  0.21623966  0.60070544 
##    Silicon.     Silver.       Zinc.  Strontium.   Rubidium.  Zirconium. 
##  0.09264434 -0.12441157 -0.07383630  0.08561231  0.21078614  0.16132446 
##    Ammonium      Sodium   Potassium     Nitrate          OC          EC 
##  0.72988855  0.03843268  0.45430156  0.39156287  0.83763790  0.75335539 
##    Sulfate. 
##  0.84164923

Ridge regression

## [1] 1.5905
## 38 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)    1.2327931
## Antimony.     -0.3693021
## Arsenic.     146.8439285
## Aluminum.     -1.9840391
## Barium.       72.6500479
## Bromine.     106.0231000
## Cadmium.      -6.2525330
## Calcium.       2.2455402
## Chromium.    -48.6316988
## Cobalt.     -291.6304553
## Copper.       37.8671311
## Chlorine.      2.3190229
## Cerium.     -392.6371290
## Cesium.       50.8906724
## Iron.         -1.2172543
## Lead.        -19.9090654
## Indium.      -12.7458894
## Manganese.   215.6197469
## Nickel.      -10.6526738
## Magnesium.    -4.6040578
## Phosphorus.   -6.2305396
## Selenium.    130.8275639
## Tin.         -17.1925657
## Titanium.    -20.8652995
## Vanadium.    -26.7817509
## Silicon.       2.4149033
## Silver.      -49.3633590
## Zinc.          8.6812635
## Strontium.    56.8969152
## Rubidium.    286.5269589
## Zirconium.   374.1075373
## Ammonium       1.3369798
## Sodium        -0.5227367
## Potassium      3.4606923
## Nitrate        0.3988884
## OC             1.1647456
## EC             1.5981218
## Sulfate.       0.9270153

By examining the coefficients of the ridge regression model, it appears that many variables have negative and/or significantly large coefficients. As discussed in the previous section, in the practical research, it is not preferable to have negative values. In addition, large coefficients indicate the model adds considerable amount of weight to the corresponding variables, which may be alarming in some cases (especially for trace elements).

Lasso regression

As we have probably seen that many components are inter-correlated, to some extent at least. This may prevent the effective modeling using multivariate regression. Lasso regression is able to aggressively reduce dimensionality by pushing many of the coefficients to zero which may be helpful in the presence of variables that are frequently zero or close to zero.

## [1] 0.3195685
## 38 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)   1.07442779
## Antimony.     .         
## Arsenic.      .         
## Aluminum.     .         
## Barium.       .         
## Bromine.      .         
## Cadmium.      .         
## Calcium.      3.04636341
## Chromium.     .         
## Cobalt.       .         
## Copper.       .         
## Chlorine.     0.56430824
## Cerium.       .         
## Cesium.       .         
## Iron.         .         
## Lead.         .         
## Indium.       .         
## Manganese.  121.55395982
## Nickel.       .         
## Magnesium.    .         
## Phosphorus.   .         
## Selenium.     .         
## Tin.          .         
## Titanium.     .         
## Vanadium.     .         
## Silicon.      .         
## Silver.       .         
## Zinc.         .         
## Strontium.    .         
## Rubidium.     .         
## Zirconium.  202.06559077
## Ammonium      2.41793184
## Sodium        .         
## Potassium     .         
## Nitrate       0.02448812
## OC            1.51867150
## EC            1.55921202
## Sulfate.      0.63219105

The results from the Lasso regression suggest that only 9 (out of 37) variables (components) end up having non-zero coefficients.

Results and discussion

model RMSE R^2
lasso 1.74 0.8753
ridge 1.979 0.8321
subsetting 2.122 0.8239
lm 2.308 0.792
pcr 2.391 0.7506
## 38 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)   1.07442779
## Antimony.     .         
## Arsenic.      .         
## Aluminum.     .         
## Barium.       .         
## Bromine.      .         
## Cadmium.      .         
## Calcium.      3.04636341
## Chromium.     .         
## Cobalt.       .         
## Copper.       .         
## Chlorine.     0.56430824
## Cerium.       .         
## Cesium.       .         
## Iron.         .         
## Lead.         .         
## Indium.       .         
## Manganese.  121.55395982
## Nickel.       .         
## Magnesium.    .         
## Phosphorus.   .         
## Selenium.     .         
## Tin.          .         
## Titanium.     .         
## Vanadium.     .         
## Silicon.      .         
## Silver.       .         
## Zinc.         .         
## Strontium.    .         
## Rubidium.     .         
## Zirconium.  202.06559077
## Ammonium      2.41793184
## Sodium        .         
## Potassium     .         
## Nitrate       0.02448812
## OC            1.51867150
## EC            1.55921202
## Sulfate.      0.63219105

Based on the results, Lasso regression yielded the best fit and prediction. Prediction using principal component regression model exhibits the largest RMSE. In general, all five evaluated models showed decent to good capability of making predictions using the testing dataset. A close took at the coefficients generated by the Lasso regression lead to the finding that components that account for large portion of the PM2.5 mass as well as the ones commonly correlated with other components were retained. For example, sulfate, nitrate, OC, EC, ammonium are the species accounting about three quarters of the PM2.5 mass. Chlorine is correlated to with several sea salt elements (sodium, magnesium etc.) and calcium is correlated to mineral species such as silicon, iron etc. To some extent, we can even identify some major sources of PM2.5 from this shorter list of variables.

Considering the additional benefit of dimension reduction by Lasso regression (9 variables in the final model), it is still the optimal choice. By examining the coefficients of Lasso regression, we see that manganese and zirconium have large coefficients, this might be due to their concentration that are close to zero which causes larger uncertainties.

Reference

  1. United States Environmental Protection Agency, Managing Air Quality - Ambient Air Monitoirng, Accessed Dec 4th 2017 at https://www.epa.gov/air-quality-management-process/managing-air-quality-ambient-air-monitoring
  2. ] Chow, J.C., Lowenthal, D. H., Chen, L.-W. A., Wang, X., Watson, J. G. (2015) Mass reconstruction methods for PM2.5: a review. Air Qual Atmos Health 8: 243-63
  3. United States Environmental Protection Agency, Pre-Generated Data Files, Accessed Nov 10 2017 at https://aqs.epa.gov/aqsweb/airdata/download_files.html
  4. The scripts of creating a predict function or regsubsets object was generated by referencing to https://suclass.stanford.edu/asset-v1:Statistics+Stats216+Winter2017+type@asset+block/ch6.html